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Abstract 

We introduce a new distance-based phylogeny reconstruction technique 
which provably achieves, at sufficiently short branch lengths, a logarithmic 
sequence-length requirement — improving significantly over previous poly- 
nomial bounds for distance-based methods and matching existing results for 
general methods. The technique is based on an averaging procedure that 
implicitly reconstructs ancestral sequences. 

In the same token, we extend previous results on phase transitions in phy- 
logeny reconstruction to general time-reversible models. More precisely, we 
show that in the so-called Kesten-Stigum zone (roughly, a region of the pa- 
rameter space where ancestral sequences are well approximated by "linear 
combinations" of the observed sequences) sequences of length 0(log n) suf- 
fice for reconstruction when branch lengths are discretized. Here n is the 
number of extant species. 

Our results challenge, to some extent, the conventional wisdom that es- 
timates of evolutionary distances alone carry significantly less information 
about phylogenies than full sequence datasets. 

Keywords: Phylogenetics, distance-based methods, phase transitions, recon- 
struction problem. 
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1 Introduction 



The evolutionary history of a group of organisms is generally represented by a 
phylogenetic tree or phylogeny [Fel04, SS03]. The leaves of the tree represent the 
current species. Each branching indicates a speciation event. Many of the most 
popular techniques for reconstructing phylogenies from molecular data, e.g. UP- 
GMA, Neighbor- Joining, and BIO-NJ [SS63, SN87, Gas97], are examples of what 
are known as distance-matrix methods. The main advantage of these methods is 
their speed, which stems from a straightforward approach: 1) the estimation of a 
distance matrix from observed molecular sequences; and 2) the repeated agglom- 
eration of the closest clusters of species. Each entry of the distance matrix is an 
estimate of the evolutionary distance between the corresponding pair of species, 
that is, roughly the time elapsed since their most recent common ancestor. This 
estimate is typically obtained by comparing aligned homologous DNA sequences 
extracted from the extant species — the basic insight being, the closer the species, 
the more similar their sequences. Most distance methods run in time polynomial 
in n, the number of leaves, and in k, the sequence length. This performance com- 
pares very favorably to that of the other two main classes of reconstruction meth- 
ods, likelihood and parsimony methods, which are known to be computationally 
intractable [GF82, DS86, Day87, MV05, CT06, Roc06]. 

The question we address in this paper is the following: Is there a price to 
pay for this speed and simplicity? There are strong combinatorial [SHP88] and 
statistical [Fel04] reasons to believe that distance methods are not as accurate as 
more elaborate reconstruction techniques, notably maximum likelihood estimation 
(MLE). Indeed, in a typical instance of the phylogenetic reconstruction problem, 
we are given aligned DNA sequences {(Ci)i=i}i€L, one sequence for each leaf 
I € L, from which we seek to infer the phylogeny on L. Generally, all sites 
(Ci)ieL, for i = 1, . . . , k, are assumed to be independent and identically distributed 
according to a Markov model on a tree (see Section 1.1). For a subset W C L, 
we denote by fiyv the distribution of (Ci)ieW under this model. Through their 
use of the distance matrix, distance methods reduce the data to pairwise sequence 
correlations, that is, they only use estimates of fi2 = {^w '■ W Q L, \W\ = 2}. 
In doing so, they seemingly fail to take into account more subtle patterns in the data 
involving three or more species at a time. In contrast, MLE for example outputs 
a model that maximizes the joint probability of all observed sequences. We call 
methods that explicitly use the full dataset, such as MLE, holistic methods. 

It is important to note that the issue is not one of consistency: when the se- 
quence length tends to infinity, the estimate provided by distance methods — just 
like MLE — typically converges to the correct phylogeny. In particular, under mild 
assumptions, it suffices to know the pairwise site distributions jj.2 to recover the 
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topology of the phylogeny [CH91, Cha96]. Rather the question is: how fast is 
this convergence? Or more precisely, how should k scale as a function of n to 
guarantee a correct reconstruction with high probability? And are distance meth- 
ods significantly slower to converge than holistic methods? Although we do not 
give a complete answer to these questions of practical interest here, we do provide 
strong evidence that some of the suspicions against distance methods are based on 
a simplistic view of the distance matrix. In particular, we open up the surprising 
possibility that distance methods actually exhibit optimal convergence rates. 

Context. It is well-known that some of the most popular distance-matrix meth- 
ods actually suffer from a prohibitive sequence-length requirement [Att99, LC06]. 
Nevertheless, over the past decade, much progress has been made in the design 
of fast-converging distance-matrix techniques, starting with the seminal work of 
Erdos et al. [ESSW99a]. The key insight behind the algorithm in [ESSW99a], of- 
ten dubbed the Short Quartet Method (SQM), is that it discards long evolutionary 
distances, which are known to be statistically unreliable. The algorithm works by 
first building subtrees of small diameter and, in a second stage, putting the pieces 
back together. The SQM algorithm runs in polynomial time and guarantees the 
correct reconstruction with high probability of any phylogeny (modulo reasonable 
assumptions) when k = poly(n). This is currently the best known convergence rate 
for distance methods. (See also [DMR06, DHJ+06, Mos07, GMS08, DMR09] for 
faster-converging algorithms involving partial reconstruction of the phylogeny.) 

Although little is known about the sequence-length requirement of MLE [SS99, 
SS02], recent results of Mossel [Mos04], Daskalakis et al. [DMR06, DMR11], and 
Mihaescu et al. [MHR09] on a conjecture of Steel [SteOl] indicate that conver- 
gence rates as low as k = O(logn) can be achieved when the branch lengths are 
sufficiently short, using insights from statistical physics. We briefly describe these 
results. 

As mentioned above, the classical model of DNA sequence evolution is a 
Markov model on a tree that is closely related to stochastic models used to study 
particle systems [Lig85, Geo88]. This type of model undergoes a phase transi- 
tion that has been extensively studied in probability theory and statistical physics: 
at short branch lengths (in the binary symmetric case, up to 15% divergence per 
edge), in what is called the reconstruction phase, good estimates of the ances- 
tral sequences can be obtained from the observed sequences; on the other hand, 
outside the reconstruction phase, very little information about ancestral states dif- 
fuses to the leaves. See e.g. [EKPSOO] and references therein. The new algorithms 
in [Mos04, DMR06, DMR11, MHR09] exploit this phenomenon by alternately 1) 
reconstructing a few levels of the tree using distance-matrix techniques and 2) es- 
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timating distances between internal nodes by reconstructing ancestral sequences 
at the newly uncovered nodes. The overall algorithm is not distance -based, how- 
ever, as the ancestral sequence reconstruction is performed using a complex func- 
tion of the observed sequences named recursive majority. The rate k = O(logn) 
achieved by these algorithms is known to be necessary in general. Moreover, the 
slower rate k = poly(n) is in fact necessary for all methods — distance -based or 
holistic — outside the reconstruction phase [Mos03]. In particular, note that dis- 
tance methods are in some sense "optimal" outside the reconstruction phase by the 
results of [ESSW99a]. 

Beyond the oracle view of the distance matrix. It is an outstanding open prob- 
lem to determine whether distance methods can achieve k = O(logn) in the re- 
construction phase 1 . From previous work on fast-converging distance methods, it 
is tempting to conjecture that k = poly(n) is the best one can hope for. Indeed, 
all previous algorithms use the following "oracle view" of the distance matrix, as 
formalized by King et al. [KZZ03] and Mossel [Mos07]. As mentioned above, the 
reliability of distance estimates depends on the true evolutionary distances. From 
standard concentration inequalities, it follows that if leaves a and b are at distance 
r(a, b), then the usual distance estimate f (a, b) (see Section 1.1) satisfies: 

if r(a, b) < D + e or f (a, b) < D + e then \r(a, b) - f (a, 6)| < e, (1) 

for e, D such that k oc (1 — e~ £ )~ 2 e 2D . Fix e > small and k <C poly(n). Let 
T be a complete binary tree with log 2 n levels. Imagine that the distance matrix 
is given by the following oracle: on input a pair of leaves (a, b) the oracle returns 
an estimate f (a, b) which satisfies (1). Now, notice that for any tree T' which is 
identical to T on the first log 2 n/2 levels above the leaves, the oracle is allowed to 
return the same distance estimate as for T. That is, we cannot distinguish T and T' 
in this model unless k = poly(n). (This argument can be made more formal along 
the lines of [KZZ03].) 

What the oracle model ignores is that, under the assumption that the sequences 
are generated by a Markov model of evolution, the distance estimates 

(r(o,6)) 0)6e[n ] 

are in fact correlated random variables. More concretely, for leaves a, b, c, d, note 
that the joint distribution of (f(a, b), f (c, d)) depends in a nontrivial way on the 
joint site distribution fiyv at W = {a, b, c, d}. In other words, even though the 
distance matrix is — seemingly — only an estimate of the pairwise correlations /i2, 

'Mike Steel offers a 100$ reward for the solution of this problem. 



4 



it actually contains some information about all joint distributions. Note however 
that it is not immediately clear how to exploit this extra information or even how 
useful it could be. 

As it turns out, the correlation structure of the distance matrix is in fact very 
informative at short branch lengths. More precisely, we introduce in this paper a 
new distance-based method with a convergence rate of k = O(logn) in the re- 
construction phase (to be more accurate, in the so-called Kesten-Stigum phase; see 
below) — improving significantly over previous poly(n) results. Note that the or- 
acle model allows only the reconstruction of a o(l) fraction of the levels in that 
case. Our new algorithm involves a distance averaging procedure that implicitly 
reconstructs ancestral sequences, thereby taking advantage of the phase transition 
discussed above. We also obtain the first results on Steel's conjecture beyond the 
simple symmetric models studied by Daskalakis et al. [DMR06, DMR1 1, MHR09] 
(the so-called CFN and Jukes-Cantor models). In the next subsections, we in- 
troduce general definitions and state our results more formally. We also give an 
overview of the proof. 

Further related work. For further related work on efficient phylogenetic tree 
reconstruction, see [ESSW99b, HNW99, CK01, Csu02]. 

1.1 Definitions 

Phylogenies. We define phylogenies and evolutionary distances more formally. 

Definition 1 (Phytogeny) A phylogeny is a rooted, edge-weighted, leaf-labeled 
tree T = (V, E, [n], p; r) where: V is the set of vertices; E is the set of edges; 
L = [n] = {0, . . . , n — 1} is the set of leaves; p is the root; r : E — > (0, +oo) is a 
positive edge weight function. We further assume that all internal nodes in T have 
degree 3 except for the root p which has degree 2. We let Y n be the set of all such 
phylogenies on n leaves and we denote Y = {Y n } n >i. 

Definition 2 (Tree Metric) For two leaves a, b G [n\, we denote by Path(a, b) the 
set of edges on the unique path between a and b. A tree metric on a set [n] is a 
positive function d : [n] x [n] — > (0, +oo) such that there exists a tree T = (V, E) 
with leaf set [n] and an edge weight function w : E — >■ (0,+oo) satisfying the 
following: for all leaves a, b G [n] 




eePath(a,b) 
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For convenience, we denote by (r(a, b)) a be ^ the tree metric corresponding to the 
phylogeny T = (V, E, [n], p; r). We extend t(u, v) to all vertices u,v G V in the 
obvious way. 

Example 1 (Homogeneous Tree) For an integer h > 0, we denote by = 
(y( h \ E^ h \ L>( h \ p( h ^; t) a rooted phylogeny where is the h-level complete 
binary tree with arbitrary edge weight function r and L^ 1 ) = [2 h ]. ForO < h! < h, 
we let L^) be the vertices on level h — h! (from the root). In particular, = L^ 

and = {pW}. We let MY = {HY„} n >i be the set of all phylogenies with 
homogeneous underlying trees. 

Model of molecular sequence evolution. Phylogenies are reconstructed from 
molecular sequences extracted from the observed species. The standard model of 
evolution for such sequences is a Markov model on a tree (MMT). 

Definition 3 (Markov Model on a Tree) Let & be a finite set of character states 
with (p = |$|. Typically $ = {+1, -1} or $ = {A, G, C, T}. Let n > 1 and let 
T = (V,E, [n] , p) be a rooted tree with leaves labeled in [n]. For each edge e G E, 
we are given a if x p stochastic matrix M e = (Mfj)ij^, with fixed stationary 
distribution it = (7Tj)j e $. An MMT ({M e } e£ £;,T) associates a state a v in <E> to 
each vertex v in V as follows: pick a state for the root p according to tt; moving 
away from the root, choose a state for each vertex v independently according to the 
distribution {M% u j)j£$>, with e = (u, v) where u is the parent of v. 

The most common MMT used in phylogenetics is the so-called general time- 
reversible (GTR) model. 

Definition 4 (GTR Model) Let <E> be a set of character states with p = |$| and 
it be a distribution on $ satisfying 7Tj > for all i G $. For n > 1, let T = 
(V, E, [n], p; t) be a phylogeny. Let Q be a ip x p rate matrix, that is, Qij > Qfor 
all i / j and 

for all i G <!>. Assume Q is reversible with respect to ir, that is, 

for all i,j G <£. The GTR model on T with rate matrix Q is an MMT on T = 
(V, E, [n], p) with transition matrices M e = e Te ® , for all e G E. By the reversibil- 
ity assumption, Q has p> real eigenvalues 

= Ai > A 2 > • • • > hp. 
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We normalize Q by fixing A2 = —1. We denote by Q^, the set of all such rate 
matrices. We let G nt p = Y n ® the set of all upstate GTR models on n 

leaves. We denote G^ = {G n)ip } n>l . We denote by £yv the vector of states on the 
vertices W C V. In particular, £[ n ] are the states at the leaves. We denote by £t,Q 
the distribution <?/£[„]. 

GTR models include as special cases many popular models such as the CFN model. 

Example 2 (CFN Model) The CFN model is the GTR model with ip = 2, ir = 
(1/2,1/2), and 

-1/2 1/2 
1/2 -1/2 



q = g CFN = 



Example 3 (Binary Asymmetric Channel) More generally, letting $ = {+, -} 
and ir = (ir + , 7r_ ), with ir + , 7r_ > 0, we can take 



Q 



-7T_ 7T_ 
7T+ -7T+ 



Phylogenetic reconstruction. A standard assumption in molecular evolution is 
that each site in a sequence (DNA, protein, etc.) evolves independently accord- 
ing to a Markov model on a tree, such as the GTR model above. Because of the 
reversibility assumption, the root of the phylogeny cannot be identified and we 
reconstruct phylogenies up to their root. 

Definition 5 (Phylogenetic Reconstruction Problem) Let Y = {Y ra } n >i be a 

subset of phylogenies and be a subset of rate matrices on ip states. Let T = 
(V,E, [n],p;7~) G Y. IfT = (V,E, [n], p) is the rooted tree underlying T, we 
denote by T_ [T] the tree T where the root is removed: that is, we replace the 
two edges adjacent to the root by a single edge. We denote by T n the set of all 
leaf-labeled trees on n leaves with internal degrees 3 and we let T = {T n } n >i. 
A phylogenetic reconstruction algorithm is a collection of maps A = {A n ,k}n,k>i 
from sequences (C[ n ])i=i £ ) fc to leaf-labeled trees T G T n . We only consider 
algorithms A computable in time polynomial in n and k. Let k(n) be an increas- 
ing function of n. We say that A solves the phylogenetic reconstruction problem 
mY® Q v with sequence length k = k(n) if for all 5 > 0, there is no > 1 such 
that for all n> n$, T G Y n , Q G Q^, 

^[AnMn) (tfU)±3)=T-[T\]>l-8, 

where {G n ])i=i are i-i-d- samples from C-t,q- 
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An important result of this kind was given by Erdos et al. [ESSW99a]. 

Theorem 1 (Polynomial Reconstruction [ESSW99a]) Let < / < g < +00 

and denote by Y$> 9 the set of all phytogenies T = (V, E, [n],p; r) satisfying f < 
T e < 9, Ve G E. Then, for all (p > 2 and all < / < g < +00, the phylogenetic 
reconstruction problem on Y^ 9 <S> can be solved with k = poly(n). 

This result was recently improved by Daskalakis et al. [DMR06, DMR11] (see 
also [MHR09]) in the so-called Kesten-Stigum reconstruction phase, that is, when 
g < In y/2. 

Definition 6 (A-Branch Model) Let < A < / < g < +00 and denote by Y^f 
the set of all phytogenies T = (V, E, [n] , p; r) satisfying f < r e < g where r e is an 
integer multiple of A, for all e G E. For ip > 2 and Q G Q^, we call Y^ 9 ® {Q} 
the A-Branch Model (A-BM). 

Let g* = In y/2. 

Theorem 2 (Logarithmic Reconstruction [DMR06, DMR11, MHR09]) For 

0<A<f<g<g*, the phylogenetic reconstruction problem on Y^ 9 ®{Q CFN } 
can be solved with k = O(logn) 2 . 

Distance methods. The proof of Theorem 1 uses distance methods, which we 
now define formally. 

Definition 7 (Correlation Matrix) Let $ be a finite set with tp > 2. Let 

be the sequences at a,b G [n\ For v\,V2 G we define the correlation matrix 
between a and b by 

k 

i=i 

andF ab = (FS b lV2 ) Vl , V2 ^- 

Definition 8 (Distance Method) A phylogenetic reconstruction algorithm A = 
jt>i is said to be distance-based if A depends on the data (C[ n ])i=i G 
(<j>M) fc nly through the correlation matrices {F ab } a ^e[n]- 

2 The correct statement of this result appears in [DMR1 1]. Because of different conventions, our 
edge weights are scaled by a factor of 2 compared to those in [DMR1 1]. The dependence of k in A 
is A" 2 . 
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The previous definition takes a very general view of distance -based methods: any 
method that uses only pairwise sequence comparisons. In practice, most distance- 
based approaches actually use a specific distance estimator, that is, a function of 
F ab that converges to r(a, b) in probability as n — > +00. We give two classical 
examples below. 

Example 4 (CFN Metric) In the CFN case with state space <E> = {+,—}, a stan- 
dard distance estimator (up to a constant) is 

P(F) = -ln(l-2(F + _+F_ + )). 

Example 5 (Log-Det Distance [BH87, Lak94, LSHP94, Ste94]) More generally, 
a common distance estimator (up to scaling) is the so-called log-det distance 

V{F) = -ln|detF|. 

Loosely speaking, the log-det distance can be thought as a generalization of the 
CFN metric. We will use a different generalization of the CFN metric. See sec- 
tion 1.3. 

1.2 Results 

In our main result, we prove that phylogenies under GTR models of mutation can 
be inferred using a distance -based method from k = O(logn) sequence length. 

Theorem 3 (Main Result) For all ip > 2, < A < / < g < g* and Q G Q v , 

there is a distance-based method solving the phylogenetic reconstruction problem 
on Y f ^ 9 ® {Q} with k = O(logn). 3 

Note that this result is a substantial improvement over Theorem 1 — at least, in a 
certain range of parameters — and that it matches the bound obtained in Theorem 2. 
The result is also novel in two ways over Theorem 2: only the distance matrix is 
used; the result applies to a larger class of mutation matrices. A weaker version 
of the result stated here was first reported without proof in [Roc08]. Note that 
in [Roc08] the result was stated without the discretization assumption which is in 
fact needed for the final step of the proof. This is further explained in Section 7.3 
of [DMR1 1]. The new proofs presented here rely on recent joint work with Yuval 
Peres [PR11] on exponential moment bounds for quantities such as a a . 

In an attempt to keep the paper as self-contained as possible we first give a 
proof in the special case of homogeneous trees. This allows to keep the algorith- 
mic details to a minimum. The proof appears in Section 3. We extend the result 
to general trees in Section 4. The more general result relies on a combinatorial 
algorithm of [DMR11]. 

" As in Theorem 2, the dependence of k in A is A -2 [Roc 10]. 
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1.3 Proof Overview 



Distance averaging. The basic insight behind Steel's conjecture is that the accu- 
rate reconstruction of ancestral sequences in the reconstruction phase can be har- 
nessed to perform a better reconstruction of the phylogeny itself. For now, consider 
the CFN model with character space {+1,-1} and assume that our phylogeny is 
homogeneous with uniform branch lengths u. Generate k i.i.d. samples (uy)f =1 . 
Let a, b be two internal vertices on level h — h' < h (from the root). Suppose we 
seek to estimate the distance between a and b. This estimation cannot be performed 
directly because the sequences at a and b are not known. However, we can try to 
estimate these internal sequences. Denote by A, B the leaf set below a and b re- 
spectively. An estimate of the sequence at a is the (properly normalized) "site-wise 
average" of the sequences at A 

1 1 a'dA 

for i = l,...,k, and similarly for b. It is not immediately clear how such a 
site-wise procedure involving simultaneously a large number of leaves can be per- 
formed using the more aggregated information in the correlation matrices {F uv } u ,ve[n] 
Nevertheless, note that the quantity we are ultimately interested in computing is the 
following estimate of the CFN metric between a and b 

f (a, b) = — In 

Our results are based on the following observation: 

k 

f(a, b) = — In 



= -In 





where note that the last line depends only on distance estimates f(a', b') between 
leaves a', U in A, B. In other words, through this procedure, which we call expo- 
nential averaging, we perform an implicit ancestral sequence reconstruction using 
only distance estimates. One can also think of this as a variance reduction tech- 
nique. When the branch lengths are not uniform, one needs to use a weighted 
version of (2). This requires the estimation of path lengths. 
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GTR models. In the case of GTR models, the standard log-det estimator does not 
lend itself well to the exponential averaging procedure described above. Instead, 
we use an estimator involving the right eigenvector v corresponding to the second 
eigenvalue A2 of Q. For a, b € [n], we consider the estimator 



This choice is justified by a generalization of (2) introduced in [MP03]. Note that 
v may need to be estimated. 

Concentration. There is a further complication in that to obtain results with 
high probability, one needs to show that f(a, b) is highly concentrated. However, 
one cannot directly apply standard concentration inequalities because a a is not 
bounded. Classical results on the reconstruction problem imply that the variance 
of a a is finite — which is not quite enough. To show concentration, we bound the 
moment generating function of a a . 

1.4 Organization 

In Section 2, we provide a detailed account of the connection between ancestral 
sequence reconstruction and distance averaging. We then give a proof of our main 
result in the case of homogeneous trees in Section 3. In Section 4, we conclude 
with a sketch of the proof in the general case. 

In the Appendix, we provide a few complimentary results. In Section A, we 
show that the distance matrix is not in general a sufficient statistic. In Section B, 
we analyze a standard algorithm, known as WPGMA, in the so-called molecular 
clock case, that is, when the mutation rate is the same on all branches of the tree. In 
particular, in the latter case we note that the discretized branch length assumption 
is not needed. 

2 Ancestral Reconstruction and Distance Averaging 

Let (p > 2, 0<A<f<g<g*=ln y/2, and Q G Q v with corresponding sta- 
tionary distribution it > 0. In this section we restrict ourselves to the homogeneous 
case T = = (V, E, [n],p; r) where we take h = log 2 n and / < r e < g and 
r e is an integer multiple of A, Ve € E. (See Examples 1 and 2 and Theorem 2.) 4 

4 Note that, without loss of generality, we can consider performing ancestral state reconstruction 
on a homogeneous tree as it is always possible to "complete" a general tree with zero-length edges. 
We come back to this point in Section 4. 




(3) 
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Throughout this section, we use a sequence length k > k log(n) where n is 
a constant to be determined later. We generate k i.i.d. samples (£y)f =1 from the 
GTR model (T, Q) with state space <3>. 

2.1 Distance Estimator 

The standard log-det estimator does not lend itself well to the averaging procedure 
discussed above. For reconstruction purposes, we instead use an estimator involv- 
ing the right eigenvector v corresponding to the second eigenvalue A2 of Q. For 
a, b e [n], consider the estimator 

f (a, b) = - In (VF af v) , (4) 

where the correlation matrix F ab was introduced in Definition 7. We first give a 
proof that this is indeed a legitimate distance estimator. For more on connections 
between eigenvalues of the rate matrix and distance estimation, see e.g. [GL96, 
GL98, GMY09]. 

Lemma 1 (Distance Estimator) Let f be as above. For all a, b G [n\, we have 

E [ e -r(a,6)] = e -T(o,6)_ 

Proof: Note that E[F^] = m (e~ T ^ Q ) Then 

_ e -r(a,6)_ 

For a G [n\ and z = 1, . . . , k, let 
Then (4) is equivalent to 

t{a,b) = -]n(^J^aialj. (5) 

Note that in the CFN case, we have simply v = (1, — 1) T and hence (5) can be 
interpreted as a generalization of the CFN metric. 



E 
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2.2 Ancestral Sequence Reconstruction 

Let e = (x, y) € E and assume that x is closest to p (in topological distance). We 
define Path(/9, e) = Path(p, y), |e| p = |Path(f,e)|, and 

R p (e) = (1 - # e 2 ) 0-;, 

where Q p , y = e' T ^ and 6 e = e~< e \ 

Proposition 1 below is a variant of Lemma 5.3 in [MP03]. For completeness, 
we give a proof. 

Proposition 1 (Weighted Majority: GTR Version) Let £r n i be a sample from Lt,Q 
(see Definition 4) with corresponding ay n y For a unitflow ^ from pto [n], consider 
the estimator 

s = * ( ' r)rT ' 



Then, we have 

and 
where 



®p,x 



E[S] = 0, 
nS\£ p ] =a p , 

Var[S] = l + K^, 

eeE 



Proof: We follow the proofs of [EKPS00, MP03]. Let e, L be the unit vector in 
direction i. Let x € [n], then 

E[eJjCp]=eJe T( ^ )Q - 

Therefore, 

E[<7* I dp] = eJ/^ x)Q y = a p e-^ x \ 

and 

nS | f„] = £ = a p ^ ¥(*) = V 

xe[n] i6[n] 

In particular, 

E[S] = ^2ir l v l = 0. 
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For x, y G [n] , let x A y be the meeting point of the paths between p,x,y. We 
have 

E[a x a y ] = ^2^[£xAy = l]^[<?x<Jy I CxAy = l] 

= ^ KiMWx I £xAy = t]E[(7j, | £ xA j, = t] 

= E'. 



e -T{xAy,x) v e -T(xAy,y) u 



= e 



Then 



Var[5] = E[S 2 } 



x,y£[n] 
x,yd [n] 

For e € £7, let e = (&f, ej.) where e-f- is the vertex closest to p. Then, by a telescop- 
ing sum, for u S y 



3 2r(p,e t ) 

eGPath(p,«) eePath(p,«) eePath(p,w) 

= e 2r( - p ' u) - 1, 



and therefore 



E[5 2 ] = y(x)y(y)e 2Tiv ' xAy) 

x,ye[n] 

x,j/G[w] \ eePath(p,zAj/) 

= l + ^i? p (e) Y l{e€Path(^,xAj/)}*(x)*(i/) 
= l + ]>> p (e)*(e) 2 . 
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Let ^ be a unit flow from p to [n]. We will use the following multiplicative 
decomposition of ^: If ty(x) > 0, we let 

and, if instead \I/(x) = 0, we let Y>(y) = 0. Denoting x^ the immediate ancestor of 
x £ V and letting 8 X = e T(x i-' x) , it will be useful to re- write 

/j'=0 a , gi (h) eePath(p,x) e 

and to define the following recursion from the leaves. For x G [n], 

= 0. 

Then, let it £ V - [n] with children ui,^ with corresponding edges ei,e2 and 
define 



a=l,2 V e « 



Note that, from (6), we have K p ^ = Kq,. 

Because of our use of short sequences, bounds on the variance are not enough 
for our purposes: We need exponential concentration on our distance estimates. To 
obtain such concentration, we give bounds on the exponential moment of S. Our 
proof generalizes a recent argument of Peres and Roch [PR11]. 

Proposition 2 (Weighted Majority: Exponential Bound) For (el, let 
r(C) = mE[exp(CS)|£ p = ^]. 

Then, there exists c > depending only on Q and f such that for all ( G 1, we 
have 

Proof: We prove the claim by induction, moving away from the leaves. We begin 
with an analytical lemma inspired by the proof of [PR11]. 
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Lemma 2 (Recursion Step) Let M = e T ® with second right eigenvector v and 
corresponding eigenvalue A = eT T satisfying r > /. Then there is c > depend- 
ing on Q and f such that for all i £ Q 

F(x) = ^2 M ij exp(ujx) < exp(Az^x + ^c(l - A 2 )x 2 ) = G(x), (7) 
for all i£i 

Proof: Let d = c(l - A 2 ). Note that 

F'{x) = ^2 MijVj exp(z/jx), 

F"(x) = ^Miji/j exp(ujx), 
G'(x) = (Xui + c'x) exp(A^x H — c'x 2 ), 



and 
Hence, 

Let 

and 

Note that 
and 



G"(x) = ((Xui + c'x) 2 + d) exp(AfjX + -jdx 2 ). 



F(0) = G(0) = 1, 
F'{0) = G'(0) = Afj. 

7f = min7r t , 

1 

= max \ uA < 



IT 



F"{x) < P 2 exp(p|x|) = F(x), 



G"(x) > c'exp(-u\x\ + -c'x 2 ) = G(x). 

Choose d = c* > such that F(x) < G(x) for all i£R. Note in particular that 
taking 

c* > max {4i>, v 2 exp(2z>)} , 
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is enough. Indeed, for \x\ > 1 we have c* > v 2 and exp(— v\x\ + \c*x 2 ) > 
exp(p|x|) so that F{x) < G(x). For \x\ < 1, we have 

G(x) > c* exp(-zv) > v 2 exp(^) > F(x). 

Now choose c = c*(l — e~ 2 ^) -1 in (7) (which implies c' > c* by r > /). 
Then, 

G"{x) > G{x) > F(x) > F"{x), 

and therefore 

G(x) > F(x), 

for all x € R. ■ 

Going back to the proof of Proposition 2, let S x = a x for all x G [n] and 

o _ o V'(ea) 

D « — /) > 

a=l,2 e " 

where w G V — [n] with children t>i, t>2 with corresponding edges ei, e2- Note that 
S P = S. Let 

ri(C) = lnE[exp(CS u )|£ u = i]. 

Take c > as in Lemma 2. The main claim is clearly true at the leaves, that is, 
for all x G [n] 

r x (C) = InE[exp(C5x)|6 = i] 
= lnE[exp((> x ) | £ x = i] 

1 2 
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For u G V — [n] as above, we have by the Markov property, induction, and Lemma 2 



exp 



K(0 = InE 



c E s «. 



Q = l,2 



a=l,2 



exp C-^u 



V>(e c 



E ln E^ QE 

a=l,2 



exp (S Vi 



\€u = i 



I &>„ = 3 



= E ln (E M ^ ex p( r i(c 

a=l,2 \je* V V 

^ E ln IE ^ -P ^ (C^) + \cK Vat9 
a =i,2 ye$ V ^ 6a ^ 

- k E (I 5 " 

^ a=l,2 V Uea 



'e a 



+ EHE^W^) 



a: 

+ E»«.".(<^) + ^-<>(< 

„ — 1 o \ e a / V 



a=l,2 



= M+UC 2 ECU-O+^a,*) 



a=l,2 



^(e Q ) 



6a 



2.3 Distance Averaging 

The input to our tree reconstruction algorithm is the matrix of all estimated dis- 
tances between pairs of leaves {r(a, &)} a ,&,e[nl- For short sequences, these esti- 
mated distances are known to be accurate for leaves that are close enough. We now 
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show how to compute distances between internal nodes in a way that involves only 
{f (a, &)} a ,6,6[n] ( an d previously computed internal weights) using Proposition 2. 
Let < ti < h. For v G L^ } , let T v = (V v , E v ) be the subtree of T = 

rooted at v with leaf set denoted L v . Let a, b 6 £7?. For x G {a, 6}, denote by X 
the leaves of T = T^> below x. Assume that we are given 6 e , for all e below a, b. 
We estimate r(a, 6) as follows 



r(a,b) = —In 



\A\\B\ ^ ^ a ' a ' b ' b 

V 11 1 a'&Ab'&B / 



This choice of estimator is suggested by the following observation 

a'eAb'eB 

Note that the first line depends only on estimates (f {u, v)) u ^ v& \ n ] and {@ V: .} v( zv a uv b - 
The last line is the empirical distance between the reconstructed states at a and b 
when the flow is chosen to be homogeneous in Proposition 1 . 

Lemma 3 (Large Deviations) Let < b! < h and let a, b G For x = a,b, 
let 

2-h'q x/ 

bx - 1^ e , ■ 

x'ex x ' x 

It holds that 

E [ e -f(a,6)] = e -r(a,6) ; 

a«<i ?/iere ex/sfs (* > ima// enough such that 

E[exp((S a S b )} < +00, 
for all I C| < |C*|- ^ w particular, for all e > ?/iere ex/sfs < % < 1 swc/i f/iaf 



e -T(a,6) _ E[ e -f (o,6)] 
Moreover, \ is a constant independent of h! . 



> e 



<x k - 
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Proof: We first prove the expectation formula. Note that 



E [ e -r(a,6)j = E 



2~ h at, 



= E 



= E 



i=l \a'eA 

2- h 'a a 

©a,a 



E^ E 



\a'eA 



\b'eB 



2- h 'ay 



&hh' 



E 



E 



E 



E 



we A 



E 



Z ay 



'eB 



E 



o-ti „ 



La's A 
= E [a a a h \ 

p --r(a,b) 



6 



a, a' 



E 



E 

.b'eB 



^ ay 

©6,6' 
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where we used that \ A\ = \B\ = r L h ' . 

To prove the large deviation result, it suffices by standard arguments [Dur96] 
to bound the exponential moment of 



S a Sb 



E 

\a'eA 



©o,a' 



E 

Kb'eB 



&b,b' 



Let N be Normal(0, 1) and recall that E[e CAr ] = e^l 2 . By applying Proposition 2 
twice and using Fubini's Theorem for positive random variables (see also [PR1 1]), 
we get (letting <I> be the homogeneous flow on T) 

E[exp((S a S b ) | &] < E[exp(a a C5 b + *cC 2 S 6 2 K a ,*) | £ a , &] 



= E[exp(a a C5 f) + ^K~^Q S h N) \ £ a , &] 
= E[exp(5 b (a a C + v ^K^ (N)) | &] 
< E[exp(a 6 ( C 7 a C + y/cKa^N) 



< +00, 



uniformly in cr a , <r& for |C| > small enough, where we used |<j |, \ab\ < v < +00, 
Cauchy-Schwarz, and 



E ^ e c 2 C 2 K a ^K b ^N 2 



1 



1 - 2(c^C 2 K a ^K b ^ 



1/2 



< +00, 
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for small enough (. Above we used the moment-generating function of the chi- 
square distribution. 5 

To prove that the large deviation result is independent of the level h', we show 
that K a ^ is uniformly bounded in h! . From (6), we have 

h'-i p 2(h'-i)g 

K a ,* < E^-^'-'-^w 

h' 

< ^ e 2jg e -(21n v / 2)j 

i=l 
h' 

= ^2^(9-9") 

3=1 

+oo 

< ^( e -2(9--fl) ) j 
j=0 

= l - e-W-g) < +°°' (8) 

where recall that 5* = In y/2 and 5 < g*. ■ 

In the next section, we use the previous lemma in two situations: 1) to estimate 
the distance between two close vertices on the same level; 2) to detect that two 
vertices on the same level are "far apart." These specializations of Lemma 3 are 
stated below. We only sketch the proofs, which are straightforward. 

Proposition 3 (Deep Distance Computation: Small Diameter) Let D > 0, 7 > 

0, and e > 0. Let a, b G L^) as above. There exist k > such that if the following 
conditions hold: 

• [Small Diameter] r(a, b) < D, 

• [Sequence Length] k > k log(n), 
then 

|f (a, b) — r(a, 6)| < e, 
with probability at least 1 — 0(n~ 7 ). 



5 A more careful analysis gives the dependence of % in A as \ — 1 — 0(A 2 ) [RoclO]. 
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Proof: Let 

e' = min{(e £ - l) e - D , (1 - e~ £ )e~ D }, 

and observe that 

f (a, b) — r(a, b) < —e 

=► e"^^ - e - r(a ' b) > (e £ - l)e~ D > e'. 

A similar implication holds in the other direction. The result now follows from 
Lemma 3. ■ 

Proposition 4 (Deep Distance Computation: Diameter Test) Let D > 0, W > 

5, and 7 > 0. Let a, b G as above. There exists k > swc/i ?/za? /f ?/ze 
following conditions hold: 

• [Large Diameter] r(a,b) > D + In W, 

• [Sequence Length] k > k log(n), 
then 

W 

f(o,6) > D + lny, 

w/fft probability at least 1 — n~ 7 . 0« o?/ier /ia«<i, if the first condition above is 
replaced by 

• [Small Diameter] r(a, 5) < D + In ^, 
then 

W 

f(a,b) <D + ln—, 
with probability at least 1 — n~ 7 . 

Proof: The proof is similar to the proof of Proposition 3. ■ 

3 Reconstructing Homogeneous Trees 

In this section, we prove our main result in the case of homogeneous trees. More 
precisely, we prove the following. 
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Theorem 4 (Main Result: Homogeneous Case) Let 0<A<f<g< +00 
and denote by MY^ g the set of all homogeneous phytogenies T = (V, E, [n], p; r) 
satisfying f < r e < g and r e is an integer multiple of A, Ve G £7. Lef g* = In \/2. 
77iera, /or all if > 2, < A < f < g < g* and Q G Q^, ?/iere is a distance- 
based method solving the phylogenetic reconstruction problem on W¥^ 9 (g> {Q} 
with k = O(logn). 

In the homogeneous case, we can build the tree level by level using simple 
"four-point" techniques [Bun71]. See e.g. [SS03, Fel04] for background and de- 
tails. See also Section 3.2 below. The underlying combinatorial algorithm we use 
here is essentially identical to the one used by Mossel in [Mos04]. From Proposi- 
tions 3 and 4, we get that the "local metric" on each level is accurate as long as we 
compute adequate weights. We summarize this fact in the next proposition. For 
A > and z G R+, we let [z]a be the closest multiple of A to z (breaking ties 
arbitrarily). For D > 0, W > 5, we define 

r w' 

SD(o, b) = 1 I [f(a, b)] A <D + ln — 

and we let 

[f(a,b)] A , ifSB(a,6) = l, 



d(a,6) = 



+00, o.w. 



Proposition 5 (Deep Distorted Metric) Let D > 0, W > 5, and 7 > 0. Let 

T = (V,E,[n],p;T) G MY^ 9 with g < g*. Let a,b G L^ ] for < ti < h. 
Assume we are given, for x = a,b, 6 e for all e G V x . There exists k > 0, such that 
if the following condition holds: 

• [Sequence Length] The sequence length is k > k log(n), 

then we have, with probability at least 1 — 0(n~ 7 ), 

d(a, b) = r(a, b) 

under either of the following two conditions: 

1. [Small Diameter] r(a, b) < D, or 

2. [Finite Estimate] d(ct, b) < +00. 

Proof: We let e < A/2. The first part of the proposition follows immediately from 
Proposition 3 and the second part of Proposition 4. For the second part, choose k 
so as to satisfy the conditions of Proposition 3 with diameter D + In W and apply 
the first part of Proposition 4. ■ 

It remains to show how to compute the weights, which is the purpose of the next 
section. 



23 



3.1 Estimating Averaging Weights 

Proposition 5 relies on the prior computation of the weights 8 e for all e £ V x , for 
x = a, b. In this section, we show how this estimation is performed. 

Let a,b,c G Denote by z the meeting point of the paths joining a, b, c. 

We define the "three-point" estimate 



Note that the expression in parenthesis is an estimate of the distance between a and 
z. 

Proposition 6 (Averaging Weight Estimation) Let a,b,c G L$ as above. As- 
sume that the assumptions of Propositions 3, 4, 5 hold. Assume further that the 
following condition hold: 

• [Small Diameter] r (a, 6), r (a, c), r (b, c) <D + \aW, 



Proof: The proof follows immediately from Proposition 5 and the remark above 
the statement of Proposition 6. ■ 

3.2 Putting it All Together 

Let < ti < h and Q = {a,b,c,d} C L^) . The topology of restricted 
to Q is completely characterized by a bi-partition or quartet split q of the form: 
ab\cd, ac\bd or ad\bc. The most basic operation in quartet-based reconstruction 
algorithms is the inference of such quartet splits. In distance-based methods in 
particular, this is usually done by performing the so-called four-point test: letting 



Of course, we cannot compute T(a, b\c, d) directly unless h! = 0. Instead we use 
Proposition 5. 




then 



with probability at least 1 — 0(n~ 7 ) where 8 z<a = 0(a; b, c). 



F(ab\cd) = -[r(a, c) + r(6, d) — r(a, b) — r(c, d)], 



we have 



ab\cd if F(a, b\c, d) > 
q = < ac\bd if F(a, b\c, d) < 
ad|fcc o.w. 
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Deep Four-Point Test. Assume we have previously computed weights 9 e for all 

e £ V x , for x = a, b, c, d. We let 

F(ab\cd) = -[d(a,c) + d(M) -d(a,6) - d(c,d)], (9) 

and we define the deep four-point test 

¥¥(a,b\c,d) = l{F(ab\cd) > f/2}, 

with FP(a, 6|c, d) = if any of the distances in (9) is infinite. Also, we extend 
the diameter test SD to arbitrary subsets by letting §D(<S) = 1 if and only if 
SB(x, y) = 1 for all pairs x,y G S. 

Algorithm. Fix D > 4g, W > 5, 7 > 3. Choose k so as to satisfy Propositions 5 
and 6. Let J?o be the set of leaves. The algorithm — a standard cherry picking 
algorithm — is detailed in Figure 1. 

Proof of Theorem 4: The proof of Theorem 4 follows from Propositions 5 and 6. 
Indeed, at each level h' , we are guaranteed by the above to compute a distorted 
metric with a radius large enough to detect all cherries on the next level using 
four-point tests. The proof follows by induction. ■ 

4 Extension to General Trees 

It is possible to generalize the previous arguments to general trees, using a com- 
binatorial algorithm of [DMR11], thereby giving a proof of Theorem 3. To apply 
the algorithm of [DMR11] we need to obtain a generalization of Proposition 5 for 
disjoint subtrees in "general position." This is somewhat straightforward and we 
give a quick sketch in this section. 

4.1 Basic Definitions 

The algorithm in [DMR11] is called Blindfolded Cherry Picking. We refer the 
reader to [DMR11] for a full description of the algorithm, which is somewhat in- 
volved. It is very similar in spirit to the algorithm introduced in Section 3.2, except 
for complications due to the non-homogeneity of the tree. The proof in [DMR11] 
is modular and relies on two main components: a distance -based combinatorial ar- 
gument which remains unchanged in our setting; and a statistical argument which 
we now adapt. The key to the latter is [DMR11, Proposition 4]. Note that [DMR1 1, 
Proposition 4] is not distance-based as it relies on a complex ancestral reconstruc- 
tion function — recursive majority. Our main contribution in this section is to show 
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Algorithm 

Input: Distance estimates {f(a, &)} ,6e[n]; 
Output: Tree; 

• For/i' = l,...,h- 1, 

1. Four-Point Test. Let 

fth' = {<? = ab\cd : Vo, b,c,d € Zy distinct such that FP(g) = 1}. 

2. Cherry Picking. Identify the cherries in 1Zh>, that is, those pairs of ver- 
tices that only appear on the same side of the quartet splits in H h i . Let 

v _ r (h'+l) (h' + l) x 

^h' + l — \di i • ■ • i a 2 h-(h' + l) ii 

be the parents of the cherries in Z^i 

3. Weight Estimation. For all z e Zh'+i, 

(a) Let x, y be the children of z. Choose w to be any other vertex in Zy 
with§D({x,y,w;}) = 1. 

(b) Compute 

6 z ,x = Q(x;y,w). 

(c) Repeat the previous step interchanging the role of x and y. 

Figure 1: Algorithm. 

how this result can be obtained using the techniques of the previous sections — 
leading to a fully distance-based reconstruction algorithm. 

In order to explain the complications due to the non-homogeneity of the tree 
and state our main result, we first need to borrow a few definitions from [DMR1 1]. 

Basic Definitions. Fix < A < / < g < g* as in Theorem 3. Let T = 
(V, E, [n],p; t) G Y^ g be a phylogeny with underlying tree T = (V, E). In this 
section, we sometimes refer to the edge set, vertex set and leaf set of a tree T' as 
£{T'),V{T'), and C{T) respectively. 

Definition 9 (Restricted Subtree) Let V' C V be a subset of the vertices of T. 
The subtree of T restricted to V' is the tree T' obtained by 1) keeping only nodes 
and edges on paths between vertices in V' and 2) by then contracting all paths 
composed of vertices of degree 2, except the nodes in V'. We sometimes use the 
notation T' =T\y. See Figure 2 for an example. 
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Figure 3: The subtrees T| {MliU2iM3iUg} and T\ {ui ^ U6jUr} are edge-disjoint. The 
subtrees T\ {uuU5>U6:Us} and T\ {u2>U3:UijU7} are edge-sharing. 



Definition 10 (Edge Disjointness) Denote by Patbr(x, y) the path (sequence of 
edges) connecting x to y in T. We say that two restricted subtrees T±, T 2 ofT are 
edge disjoint if 

Path T (xi,yi) n Path T (x 2 ,y 2 ) = 0, 

for all xi,yi € C{T\) and X2, yi G £(^2). We say that T±, T 2 are edge sharing 
they are not edge disjoint. See Figure 3 for an example. 

Definition 11 (Legal Subforest) We say that a tree is a rooted full binary tree if 
all its internal nodes have degree 3 except the root which has degree 2. A restricted 
subtree T\ ofT is a legal subtree ofT if it is also a rooted full binary tree. We say 
that a forest 

J 7 = {Ti,T 2 , . . .}, 

is legal subforest ofT if the T L 's are edge-disjoint legal subtrees ofT. We denote 
by p(F) the set of roots ofT. 

Definition 12 (Dangling Subtrees) We say that two edge-disjoint legal subtrees 
T\, T2 ofT are dangling if there is a choice of root for T not in T\ or T 2 that is 
consistent with the rooting of both T\ and T 2 . See Figure 4 below for an example 
where two legal, edge-disjoint subtrees are not dangling. 

Definition 13 (Basic Disjoint Setup (General)) Let T\ = T X1 and T 2 = T X2 be 

two restricted subtrees ofT rooted at x\ and x 2 respectively. Assume further that 
T\ and T 2 are edge-disjoint, but not necessarily dangling. Denote by y^z L the 
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Figure 4: Basic Disjoint Setup (General). The rooted subtrees T\,Ti are edge- 
disjoint but are not assumed to be dangling. The white nodes may not be in the 
restricted subtrees Ti, T^. The case w\ = X\ and/or W2 = X2 is possible. Note that 
if we root the tree at any node along the dashed path, the subtrees rooted at y\ and 
y2 are edge-disjoint and dangling (unlike T\ and Ti). 



children of x t in T b , i = 1, 2. Let w L be the node in T where the path between T\ 
and T2 meets T t , 1 = 1, 2. Note that w L may not be in T L since T L is restricted, 
l = 1, 2. If w L 7^ x b , assume without loss of generality that w L is in the subtree 
of T rooted at z L , 1 = 1,2. We call this configuration the Basic Disjoint Setup 
(General). See Figure 4. Let t(T\, T2) be the length of the path between w\ and 
W2 in the metric r. 



4.2 Deep Distorted Metric 

Our reconstruction algorithm for homogeneous trees (see Section 3) builds the tree 
level by level and only encounters situations where one has to compute the distance 
between two dangling subtrees (that is, the path connecting the subtrees "goes 
above them"). However, when reconstructing general trees by growing a subforest 
from the leaves, more general situations such as the one depicted in Figure 4 cannot 
be avoided and have to be dealt with carefully. 

Hence, our goal in this subsection is to compute the distance between the inter- 
nal nodes x\ and %2 in the Basic Disjoint Setup (General). We have already shown 
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how to perform this computation when T\ and T2 are dangling, as this case is han- 
dled easily by Proposition 5 (after a slight modification of the distance estimate; see 
below). However, in the general case depicted in Figure 4, there is a complication. 
When T\ and T2 are not dangling, the reconstructed sequences at x\ and X2 are not 
conditionally independent. But it can be shown that for the algorithm Blindfolded 
Cherry Picking to work properly, we need: 1) to compute the distance between 
x\ and X2 correctly when the two subtrees are close and dangling; 2) detect when 
the two subtrees are far apart (but an accurate distance estimate is not required in 
that case). This turns out to be enough because the algorithm Blindfolded Cherry 
Picking ensures roughly that close reconstructed subtrees are always dangling. We 
refer the reader to [DMR1 1] for details. 

The key point is the following: if one computes the distance between y\ and 1/2 
rather than the distance between x\ and X2, then the dangling assumption is satis- 
fied (re-root the tree at any node along the path connecting w\ and W2). However, 
when the algorithm has only reconstructed T\ and T2, we cannot tell which pair 
in {yi,Z\} x {j/2, ^2} is the right one to use for the distance estimation. Instead, 
we compute the distance for all pairs in {yi, z\} x {j/2, ^2} and the following then 
holds: in the dangling case, all these distances will agree (after subtracting the 
length of the edges between x\,X2 and {y±, z±, j/2, ^2}); m the general case, at 
least one is correct. This is the basic observation behind the routine DlSTORTED- 
Metric in Figure 5 and the proof of Proposition 7 below. We slightly modify the 
definitions of Section 3. 

Using the notation of Definition 13, fix (a, b) € {y±, z{\ x {y 2 , ^2}- For x = 
a, b, denote by X the leaves of T x and let \£\ x be the graph distance (that is, the 
number of edges) between x and leaf £ G X. Assume that we are given 9 e for all 
e G 8(T a ) U £(T b ). We estimate r(a, b) as follows 



and we can think of the weights on A (similarly for B) as resulting from a homoge- 
neous flow from a to A. Then, the bounds on the variance and the exponential 
moment of 




Note that, because the tree is binary, it holds that 




s a = £ 2-n°e a -> a ,, 



a'eA 
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in Propositions 1 and 2 still hold with 

K a ,* a = R a( e Mef- 

ee£(T a ) 

Moreover K a ^ a is uniformly bounded following an argument identical to (8) in the 
proof of Lemma 3. In particular, the same large deviations result hold for f (a, 6). 
For D > 0, W > 5, we define 

SD(a,6) = l|[f(a,6)] A <D + lny|, 

and we let 

[f(a,b)] A , ifSB(a,6) = l, 



d(o,6) 



+oo, o.w. 



Algorithm DistortedMetric 

Input: Rooted forest T = {Ti,T 2 } rooted at vertices Xi,x 2 ; weights r e , for all e e 

£(Ti)u£(T 2 ); 

Output: Distance T; 

• [Children] Let y L , z L be the children of x L in T for l = 1, 2 (if x L is a leaf, set 

Zi = Vi = x,)\ 

• [Distance Computations] For all pairs (a, b) G {yi, z{\ x {y 2 , z 2 }, compute 

T>(a, b) := d(a, b) — r(a, xi) — t(6, x 2 ); 

• [Multiple Test] If 

max{2?(ri 1) ,r^ 1) )-2?(rp ) ,r^ 2) ) : 

(r^,rW) e zi} x {y 2 , z 2 }, 6 = 1, 2} = 0, 

return T := T>(z\, z 2 ), otherwise return T := +oo (return T := +oo if any of 
the distances above is +oo). 



Figure 5: Routine DistortedMetric. 

Proposition 7 (Accuracy of DistortedMetric) Let D > 0, W > 5, 7 > 

and g < g' < g*. Consider the Basic Disjoint Setup (General) with T = {T±, T2} 
and Q = {yi, Zi,y2, Z2}. Assume we are given e for all e 6 £{T\) U £ (I2). Let 
T denote the output of DISTORTEDMETRIC in Figure 5. There exists k > 0, such 
that if the following condition holds: 
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• [Edge Length] It holds that r(e) < g', Ve € £{T X ), x € Q 6 ; 

• [Sequence Length] 77ie sequence length is k > k log(n), 
then we have, with probability at least 1 — 0(n~ 7 ), 

T = r(xi,x 2 ) 
under either of the following two conditions: 

1. [Dangling Case] T\ and T2 are dangling and t(T±, T2) < D, or 

2. [Finite Estimate] T < +00. 

Proof: The proof, which is a simple combination of the proof of Proposition 5 and 
the remarks above the statement of Proposition 7, is left out. ■ 

Full Algorithm. The rest of the Blindfolded Cherry Picking algorithm is un- 
changed except for an additional step to compute averaging weights as in the algo- 
rithm of Section 3. This concludes our sketch of the proof of Theorem 3. 



6 For technical reasons explained in [DMR11], we allow edges slightly longer than the upper 
bound g. 
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A The Distance Matrix is Not Sufficient 



A statistic (i.e., a function of the full data) is called sufficient if, conditioned on the 
value of the statistic, the distribution of the full data does not depend on the param- 
eters of the generating model. Roughly speaking, a sufficient statistic encapsulates 
all the information about the data. See e.g. [Was04]. In this section, we show that 
the pairwise correlation matrices do not constitute a sufficient statistic for the full 
Markov model of evolution. Hence, there is in principle more information in the 
full sequence dataset than there is in the matrix of evolutionary distances. 

We give a simple example of non-sufficiency. Consider a four-leaf tree with 
leaf set L = {a, b, c, d} and split ab\cd. Assume we use a CFN model with purines 
denoted "0" and pyrimidines denoted "1" with equal mutation probabilities p. Con- 
sider the following correlation matrices 



4' 



for alH ^ j G L and v\,V2 G {0, 1}. Two different datasets consistent with these 
correlation matrices are 



Datai 



1 1 1 1 

110 10 1 

10 110 10 

1 1 1 1 



and 



Datao 



1 1 1 1 

1 1 1 1 
10 10 10 1 
110 10 1 



where the columns are the sites and the rows are the leaves in the order a, b, c, d. 

We compare the probability of observing the two datasets under two different 
values of p: p = e and = 1/2 — e for e > small. In the first case, in a first 
approximation it suffices to compute the parsimony scores and we have 



and 



P.patai] = (!) + 0( £ 9 ) = ^ + 0( £ 9 ), 
P £ [Data 2 ] = (I)' (|) 2 (^ + 0^) = ^ + 0(e n ). 
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In particular, we get the ratio 

P £ [Data 2 | F] _ P £ [Data 2 ] 
P e [Datai|F] P e [Datai] 



+ 0(e s 



On the other hand, if p = 1/2 — e then the state distribution is almost uniform 
and we get 

P 1/2 _ £ [Data 2 |F] = P 1/2 _ £ [Data 2 ] = ^ 
P 1/2 _ £ [Datai \F] Pi/ 2 _ £ [Datai] £ ' 

Since the ratios are different, we have shown that the distribution of the data condi- 
tioned on the correlation matrices depends on the parameters of the model. There- 
fore, the distance matrix is not a sufficient statistic. 



B Probabilistic Analysis of WPGMA 

Let < / < g < +00 and denote by UY?' 9 the set of all phylogenies T = 
(V,E, [n],p;r) G Y*>» where we have further that r is ultrametric, that is, for all 
v G V it holds that t(v,x) = r(v,y) = t(v), for all leaves x,y below v. This 
is known as the molecular clock assumption, that is, the case where the mutation 
rate is equal on all edges. In that case, there are particularly simple clustering 
algorithms. We recall the WPGMA algorithm in Figure 6. In the molecular clock 
case, it is enough to consider "uncorrected" distances [RS96]. Therefore, we run 
WPGMA with the uncorrected distance estimates 

Tu(a,b) = , 

where 

u(a, b) = u T F ab v, 

for a, 6 G [n\. We call a subset of leaves A a clade if it corresponds to all 
leaf descendants of an internal node a* called the most recent common ancestor 
(MRCA). For a clade A with MRCA a* and a leaf a G A, we let \cl\a = \a\ a * and 
©A = ©a*, a- For disjoint clades A and B, we let 

f u (A B) = £ £ 2-l«U2-H*f u (a, b) = 1 ~ ^ B \ 

a&Ab&B 

where 

u(A, B) = J2J2 2" HA 2 H6|s ^(a, b). 

aeAbeB 
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We define 



w (o,6) = e- r( - a ' b \ 



and 



T u (a, b) = 



1 _ e ~r(a,b) 



2 



And similarly for u(A, B) and r u (^4, B). 

Throughout this section, we use a sequence length k > ftlog(ro) where k is a 
constant to be determined later. 

Algorithm WPGMA 

Input: Distance estimates {f u (a, b)} a be [ n y, 
Output: Tree; 

• Initialization. Let Z be the set of leaves as clusters, that is, 



• Main Loop. For i = 1, . . . , n — 1, 

- Selection Step. Let 

(A*,B*) e argmin{f u (A,_B) : distinct}. 

Merge clusters A*, B* to obtain Zj. 

- Reduction Step. For all CeZ,- {A* U S*}, compute 



• Output. Output tree implied by the successive clusterings Z , . . . , Z n _i. 



Figure 6: Algorithm WPGMA. 

Theorem 5 (Analysis of WPGMA) For all < f < g < g*, WPGMA solves the 
phylogenetic reconstruction problem on UY^' S (g) {Q} with k = O(logn). 

Proof: Fix D > 3g + 2f, 2g + 2f < D < D, and 



Z = {{1} : I G [n]}, 



and for all a, 6 € [n] let 



f u ({a},{6}) = 



f u (a, 6). 



f u (C, A* U B*) = i[f u (C, A*) + r u (C, B*)}. (10) 
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This choice ensures that 



1 + e' > ' 



and / 

D-2g-2f 1 - g/ -, 
1 I / ' 

1 + e 

which will be needed later. Let 

e = mmje'e -15 , e'e~— }, 

and let x be as in Lemma 3 for this choice of e. Taking k large enough, assume the 
conclusion of Lemma 3 holds for all pairs of clades in the tree, an event we denote 
by (*). 

By definition, we have 

t u (A,B)<t u (A',B') ^ u(A,B)>u(A',B'). 

For convenience, in the rest of the proof we work with Cj rather than f u . If A, B 
are disjoint clades with respective MRCA a* and b* satisfying r(a*,b*) < D, we 
have 

w(A,B) < u(A,B) + @ A @ B £ 

< e A e B (e^ a * r) +e'e- T> ) 

= u{A,B){\ + 6>), 

and similarly 

u(A,B) > lo(A,B)(1-s'). 

On the other hand, if r(a*,b*) > D, we have 

cu(A,B) < u(A,B) + @a@b£ 

< Q A Q B {e~ T{a * r) +e'e-Z) 

= e A Q B e^(l + e'). 

By (★) these inequalities hold for all such pairs of clades. 

Two clades A, B are sister clades if their MRCA is their immediate ancestor. 
We use the following convention. Recall that the leaves are denoted {1, . . . , n}. We 
let min A be the smallest label in A. When denoting a pair of sister clades (A, B), 
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we always assume min A < min B. There are n— 1 pairs of sister clades. Order the 
sister pairs by decreasing value of co(A, B), breaking ties by lexicographic order 
over (min A, min B): 

(Ai, Bi), . . . , (A n _i, i?„_i). 

We assume that WPGMA uses the same tie -breaking rule. We let C{ = Ai U B{. 

We prove the following basic claim. For all i = 1, . . . , n — 1, at Selection 
Step i we choose (A*,B*) = (Ai,Bi). The result then follows. We work by 
induction. For i = 0, there is nothing to prove. Assume the claim holds up to some 
1 < i < n — 1. We make a series of observations: 

1. All the current clusters in Zi_\ are clades. This follows from the induc- 
tion hypothesis. By the induction hypothesis, we also get that the values 
fu(^4, B) computed at the Reduction Steps indeed correspond to our origi- 
nal definition: 

f u (A, B) = £ £ 2-l°U2-H*f u (a, b) = i^IMl , 

aeAb&B 

where 

Cj{A, B) = Y j Y. 2" |aU 2-l b l B w(a, b). 

2. We show that for all C G -Zj-i, we have 

uiA^B^e-V <e 2 c < to(Ai, Bi)e 2g+2 f . 

Let C G such that C = A U B for sister clades A, i?. By (*), we have 
that 

@ 2 C = u(A,B) 

> lu(A,B)(1 + e')- 1 

> LoiA^BiXl + e'y 1 

> uj(Ai,Bi)j-^ 

> u{Ai,Bi)e- v . 

Conversely, if a clade C = A U B with sister clades A, B satisfies 

e 2 c = Lo(A,B)>oj(A i ,B l )e 2f , (11) 
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then 



u(A,B) > (l-s')u(A,B) 

> (1 - e'M^, Bi)e 2/ 

> {l + e')Lj(Ai,Bi) 

> u>(Ai,Bi), (12) 

so that C must be included in a cluster of Zi by our induction hypothesis. In 
particular, if two sister clades A, B are such that 6^, 6^ > to(Ai, Bi)e 2g+2 f 
then (11) is satisfied, that is, u(A,B) > u(Ai, Bi)e 2f . By (12), (A,B) 
would have been selected in a previous iteration by induction. That implies, 
for all C G Zi- lf 

e 2 c <u(A l ,B l )e 2g+2 f. 

3. We claim that Ai,B{ G Z^\. Indeed, by the previous paragraph all clades 
with B 2 -value at least uj(Ai, Bi)e 2 ^ have been constructed in a previous 
iteration. In particular, the clade Ai has been constructed in a previous step 
as it satisfies 

Q Ai e-f > @ Ci = y/u(Ai,Bi). 

The same holds for B{. Moreover, Ai and B- L being sister clades of each 
other (and no other clades), they cannot have been selected inside another 
pair by our induction hypothesis. 

4. By construction, (Ai,Bi) is chosen over all other sister clades present in 
Zi-\. So it remains to show that (Ai,Bi) is selected over all other pairs. 
Pairs of clades that are far enough will not be selected. That is, if A, B with 
MRCA a*,b* is such that 

r(a*,b*) > D, 

then 

w(A,B) < 

< 
< 
< 

by assumption on e' . 
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u;(Ai,Bi)e 2 9 +2 fe-^l + e') 
uiA^B^l-e'r'e^e-^l + e') 
Lu(Ai,Bi), 



5. Finally, non-sister clades that are closer than D cannot be selected. Indeed, 
assume by contradiction that (A*,B*) is such a pair. Since (A*,B*) are not 
sister clades, at least one of them, say A* without loss of generality, has an 
immediate ancestor u that is stricly lower than the MRCA of A* and B*. 
Take C* to be any clade in below u that is different than A*. There 
must be such a clade because otherwise A* would have been merged with its 
sister already. The MRCA of A* and C* is u. Moreover, we must have 

9^. >u(Ai,Bi)e- 2f , 

and 

& c .<u>(A i ,B i )e 2 ° +2 f, 

so that 

r(a*,c*) < 2g + g + 2f < 3<? + 2/ < D, 
where a* and c* are the MRCA of A* and C* respectively. Finally by (★) 

lu(A*,C*) > u(A*,C*)(l-e') 

> u(A*,B*)e 2f {l-e') 

> Cj{A\B*){l + e')- l e 2f {l-e') 

> Cj(A*,B*). 

This is a contradiction. 
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